Data wrangling (tidyverse and friends)
I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code…. Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.
1 Manipulating data sets
Rarely are data transcribed and organised into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.
Wickham (2014) suggested that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:
- variables are in columns
- observations are in rows - that is, for univariate data, there should be a separate row for each response observation.
To achieve tidy data, common data preparations include:
- Reshaping and rearranging data
- Merging multiple data sources
- Aggregating data to appropriate spatial/temporal scales
- Transforming data and deriving new variables
- Sorting and reordering data
- Relabelling data
This workshop will concentrate on these essential data preparation phases.
Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:
(if online)
(if offline and are running this from a local version having first cloned the ReefCloud Workshop github repository
An overview of the example data sets follows.
These data comprise two Responses (Resp1 and Resp2) that are each observed on four occasions (Time) from each of three dosage treatments (Dose) nested within four plots (Plot). Two of the four plots completely excluded something and the other two were controls (did not exclude the thing).
View data
| Treatment | Plot | Dose | Time | Resp1 | Resp2 |
|---|---|---|---|---|---|
| Control | P1 | H | 1 | 8.12 | 3.06 |
| Control | P1 | H | 2 | 20.55 | 25.94 |
| Control | P1 | H | 3 | 27.49 | 29.85 |
| Control | P1 | H | 4 | 44.79 | 25.39 |
| Control | P1 | M | 1 | 20.99 | 20.31 |
| Control | P1 | M | 2 | 37.54 | 17.62 |
| Control | P1 | M | 3 | 61.46 | 98.44 |
| Control | P1 | M | 4 | 82.21 | 160.01 |
| Control | P1 | L | 1 | 31.73 | 21.22 |
| Control | P1 | L | 2 | 59.08 | 37.51 |
| Control | P1 | L | 3 | 94.54 | 119.22 |
| Control | P1 | L | 4 | 121.17 | 116.45 |
| Control | P2 | H | 1 | 8.14 | 23.93 |
| Control | P2 | H | 2 | 13.36 | 28.02 |
| Control | P2 | H | 3 | 33.37 | 37.17 |
| Control | P2 | H | 4 | 39.87 | 38.25 |
| Control | P2 | M | 1 | 19.95 | 19.73 |
| Control | P2 | M | 2 | 42.83 | 40.52 |
| Control | P2 | M | 3 | 62.46 | 4.81 |
| Control | P2 | M | 4 | 81.78 | 136.66 |
| Control | P2 | L | 1 | 32.76 | 30.70 |
| Control | P2 | L | 2 | 62.35 | 123.78 |
| Control | P2 | L | 3 | 90.22 | 113.87 |
| Control | P2 | L | 4 | 114.03 | 76.52 |
| Exclusion | P3 | H | 1 | 21.86 | 23.58 |
| Exclusion | P3 | H | 2 | 39.83 | 28.03 |
| Exclusion | P3 | H | 3 | 59.53 | 21.32 |
| Exclusion | P3 | H | 4 | 75.59 | 90.76 |
| Exclusion | P3 | M | 1 | 38.57 | 30.63 |
| Exclusion | P3 | M | 2 | 81.25 | 83.61 |
| Exclusion | P3 | M | 3 | 124.08 | 124.09 |
| Exclusion | P3 | M | 4 | 159.69 | 112.65 |
| Exclusion | P3 | L | 1 | 61.16 | 39.53 |
| Exclusion | P3 | L | 2 | 119.84 | 110.27 |
| Exclusion | P3 | L | 3 | 175.87 | 286.33 |
| Exclusion | P3 | L | 4 | 238.76 | 54.23 |
| Exclusion | P4 | H | 1 | 18.82 | 28.60 |
| Exclusion | P4 | H | 2 | 39.82 | 39.07 |
| Exclusion | P4 | H | 3 | 63.30 | 93.43 |
| Exclusion | P4 | H | 4 | 82.29 | 60.15 |
| Exclusion | P4 | M | 1 | 39.51 | 45.90 |
| Exclusion | P4 | M | 2 | 79.24 | 88.04 |
| Exclusion | P4 | M | 3 | 122.09 | 84.19 |
| Exclusion | P4 | M | 4 | 161.67 | 256.34 |
| Exclusion | P4 | L | 1 | 57.93 | 85.24 |
| Exclusion | P4 | L | 2 | 117.88 | 167.90 |
| Exclusion | P4 | L | 3 | 181.09 | 314.49 |
| Exclusion | P4 | L | 4 | 242.31 | 304.70 |
Note as this data set is 48 rows long, the majority examples that feature this dataset will only show the first 6 (or 10) rows to save on space. This truncation will occur via the head() function.
These data comprise a Response observed on three occasions (Time.0, Time.1 and Time.2) from each of two (A1 and A2) treatments (Between, gridded or not) nested within four plots (Plot).
View data
| Plot | Between | Time.0 | Time.1 | Time.2 | |
|---|---|---|---|---|---|
| R1 | P1 | A1 | 8 | 14 | 14 |
| R2 | P2 | A1 | 10 | 12 | 11 |
| R3 | P3 | A2 | 7 | 11 | 8 |
| R4 | P4 | A2 | 11 | 9 | 2 |
Similar to dat.1, these data comprise two Responses (Resp1 and Resp2). However, these data have a different hierarchical structure. Each response was measured from one of two Subplots nested within one of six Plots. Within each Subplot each of two treatment levels (B1 and B2) were applied and two of the Plots where exposed to one of the three Between levels.
View data
| Resp1 | Resp2 | Between | Plot | Subplot | Within |
|---|---|---|---|---|---|
| 8 | 17 | A1 | P1 | S1 | B1 |
| 10 | 18 | A1 | P1 | S1 | B2 |
| 7 | 17 | A1 | P1 | S2 | B1 |
| 11 | 21 | A1 | P1 | S2 | B2 |
| 14 | 19 | A2 | P2 | S3 | B1 |
| 12 | 13 | A2 | P2 | S3 | B2 |
| 11 | 24 | A2 | P2 | S4 | B1 |
| 9 | 18 | A2 | P2 | S4 | B2 |
| 14 | 25 | A3 | P3 | S5 | B1 |
| 11 | 18 | A3 | P3 | S5 | B2 |
| 8 | 27 | A3 | P3 | S6 | B1 |
| 2 | 22 | A3 | P3 | S6 | B2 |
| 8 | 17 | A1 | P4 | S7 | B1 |
| 10 | 22 | A1 | P4 | S7 | B2 |
| 7 | 16 | A1 | P4 | S8 | B1 |
| 12 | 13 | A1 | P4 | S8 | B2 |
| 11 | 23 | A2 | P5 | S9 | B1 |
| 12 | 19 | A2 | P5 | S9 | B2 |
| 12 | 23 | A2 | P5 | S10 | B1 |
| 10 | 21 | A2 | P5 | S10 | B2 |
| 3 | 17 | A3 | P6 | S11 | B1 |
| 11 | 16 | A3 | P6 | S11 | B2 |
| 13 | 26 | A3 | P6 | S12 | B1 |
| 7 | 28 | A3 | P6 | S12 | B2 |
The data.d data comprise of a single response (Resp1) collected on different days. The three different components of the dates have been recorded in three different columns corresponding to the year, month and day.
View data
| year | month | day | Resp1 |
|---|---|---|---|
| 2009 | 11 | 09 | 25 |
| 2008 | 05 | 08 | 52 |
| 2009 | 05 | 23 | 18 |
| 2009 | 04 | 15 | 21 |
| 2008 | 10 | 25 | 45 |
| 2008 | 09 | 26 | 40 |
| 2008 | 07 | 05 | 38 |
| 2008 | 11 | 02 | 9 |
| 2009 | 08 | 19 | 0 |
| 2008 | 10 | 03 | 20 |
The data.bio dataset represents a design that is like a cut down version of data in which there are no Within Subplot levels and thus there are also fewer rows.
View data
| Resp1 | Resp2 | Between | Plot | Subplot |
|---|---|---|---|---|
| 8 | 18 | A1 | P1 | S1 |
| 10 | 21 | A1 | P1 | S2 |
| 11 | 23 | A1 | P2 | S4 |
| 14 | 22 | A2 | P3 | S5 |
| 12 | 24 | A2 | P3 | S6 |
| 11 | 23 | A2 | P4 | S7 |
| 9 | 20 | A2 | P4 | S8 |
| 14 | 11 | A3 | P5 | S9 |
| 11 | 22 | A3 | P5 | S10 |
| 8 | 24 | A3 | P6 | S11 |
| 2 | 16 | A3 | P6 | S12 |
The data.chem dataset represents the same design as the data.bio data except that rather than contain two biological response, it has two chemical measurements (Chem1 and Chem2) collected from the same sampling units as the biological responses.
View data
| Chem1 | Chem2 | Between | Plot | Subplot |
|---|---|---|---|---|
| 1.452878 | 0.8858208 | A1 | P1 | S1 |
| 3.266253 | 0.1800177 | A1 | P1 | S2 |
| 1.178652 | 5.0780682 | A1 | P2 | S3 |
| 13.400350 | 1.5762780 | A1 | P2 | S4 |
| 3.779183 | 1.6222430 | A2 | P3 | S5 |
| 1.196657 | 4.2369184 | A2 | P3 | S6 |
| 5.687807 | 2.9859003 | A2 | P4 | S8 |
| 4.834518 | 4.1328919 | A3 | P5 | S9 |
| 2.002931 | 3.6043314 | A3 | P5 | S10 |
| 12.326867 | 1.7763576 | A3 | P6 | S11 |
| 4.014221 | 0.2255188 | A3 | P6 | S12 |
The data.geo dataset represent geographical information about the Plots in the data.bio dataset.
View data
| Plot | LAT | LONG |
|---|---|---|
| P1 | 17.9605 | 145.4326 |
| P2 | 17.5210 | 146.1983 |
| P3 | 17.0011 | 146.3839 |
| P4 | 18.2350 | 146.7934 |
| P5 | 18.9840 | 146.0345 |
| P6 | 20.1154 | 146.4672 |
The tikus dataset is a data.frame version of the data with the same name that is provided as a list within the mvabund package. The data contain a the abundance of 75 coral species from different locations (rep) over six years (time) at Tikus island. The abundance was measured as the length (in cm) of a 10m transect that intersected with the coral species.
View data
Note, these data actually have 60 rows and 77 columns. In order avoid displaying a huge table in this quick view, I have trimmed the view to just the first 10 rows and six columns.
| time | rep | Psammocora contigua | Psammocora digitata | Pocillopora damicornis | Pocillopora verrucosa | |
|---|---|---|---|---|---|---|
| V1 | 81 | 1 | 0 | 0 | 79 | 32 |
| V2 | 81 | 2 | 0 | 0 | 51 | 21 |
| V3 | 81 | 3 | 0 | 0 | 42 | 35 |
| V4 | 81 | 4 | 0 | 0 | 15 | 0 |
| V5 | 81 | 5 | 0 | 0 | 9 | 0 |
| V6 | 81 | 6 | 0 | 0 | 72 | 0 |
| V7 | 81 | 7 | 0 | 0 | 0 | 41 |
| V8 | 81 | 8 | 0 | 0 | 16 | 25 |
| V9 | 81 | 9 | 0 | 0 | 0 | 38 |
| V10 | 81 | 10 | 0 | 0 | 16 | 0 |
The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Transformation Cheat Sheet and Data Tidying Cheat Sheet (mainly just the first page) are useful summaries.
https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/main/tidyr.pdf
https://github.com/rstudio/cheatsheets/raw/main/factors.pdf
https://github.com/rstudio/cheatsheets/raw/main/lubridate.pdf
https://github.com/rstudio/cheatsheets/raw/main/purrr.pdf
https://github.com/rstudio/cheatsheets/raw/main/regex.pdf
https://github.com/rstudio/cheatsheets/raw/main/strings.pdf
https://github.com/rstudio/cheatsheets/raw/main/tidyeval.pdf